Take-home Exercise 1: Geospatial Analytics for Social Good

The Task

The specific tasks of this take-home exercise are as follows:

  • Using appropriate sf method, import the shapefile into R and save it in a simple feature data frame format. Note that there are three Projected Coordinate Systems of Nigeria, they are: EPSG: 26391, 26392, and 26303. We can use any one of them.

  • Using appropriate tidyr and dplyr methods, derive the proportion of functional and non-functional water point at LGA level.

  • Combining the geospatial and aspatial data frame into simple feature data frame.

  • Performing outliers/clusters analysis by using appropriate local measures of spatial association methods.

  • Performing hotspot areas analysis by using appropriate local measures of spatial association methods.

Thematic Mapping

  • Plot maps to show the spatial distribution of functional and non-functional water point rate at Local Government Area (LGA) level by using appropriate thematic mapping technique provided by tmap package.

Analytical Mapping

  • Plot hotspot areas and outliers/clusters maps of functional and non-functional water point rate at LGA level by using appropriate thematic mapping technique provided by tmap package.

Overview

Geospatial analytics hold tremendous potential to address complex problems faced by society. In this study, we are tasked to apply appropriate global and local measures of spatial association techniques to reveals the spatial patterns of non-functional water points. For the purpose of this study, Nigeria will be used as the study country.

Installing & Loading R Packages

In the code chunk below, p_load() of pacman package is used to install and load the following R packages into R environment:

  • sf

  • tidyverse

  • tmap

  • spdep

  • funModeling, to be used for rapid Exploratory Data Analysis

pacman::p_load(sf, tidyverse, tmap, spdep, funModeling)

The Data

Aspatial data

For the purpose of this exercise, data from WPdx Global Data Repositories will be used. There are two versions of the data. They are: WPdx-Basic and WPdx+. We are required to use WPdx+ data set.

Geospatial data

Nigeria Level-2 Administrative Boundary (also known as Local Government Area) polygon features GIS data will be used in this exercise. The data can be downloaded either from The Humanitarian Data Exchange portal or geoBoundaries.

Importing Geospatial Data

Two geospatial data sets used are:

  • geo_export

  • nga_admbnda_adm2_osgof_20190417

Importing water point geospatial data

First, we are going to import the water point geospatial data (i.e. geo_export) by using the code chunk below.

(Since we have previously used this data set in the in-class exercise, we will use the data directly from there.)

wp <- st_read(dsn = "C:/Jacobche/ISSS624/In-class_Ex/rawdata",
              layer = "geo_export",
              crs = 4326) %>%
  filter(clean_coun == "Nigeria")

Things to learn from the code chunk above:

  • st_read() of sf package is used to import geo_export shapefile into R environment and save the imported geospatial data into simple feature data table.

  • filter() of dplyr package is used to extract water point records of Nigeria only.

Note: Avoid performing transformation if you plan to use st_intersects() of sf package in the later stage of the geoprocessing. This is because st_intersects() only works correctly if the geospatial data are in geographic coordinate system (i.e wgs84).

Next, write_rds() of readr package is used to save the extracted sf data table (i.e. wp) into an output file in rds data format. The output file is called wp_nga.rds and it is saved in rawdata sub-folder, which will not be uploaded to Git.

wp_nga <- write_rds(wp,
                    "C:/Jacobche/ISSS624/In-class_Ex/rawdata/wp_nga.rds")

Importing Nigeria LGA boundary data

Now, we are going to import the LGA boundary data into R environment by using the code chunk below.

nga <- st_read(dsn = "C:/Jacobche/ISSS624/In-class_Ex/data",
               layer = "nga_admbnda_adm2_osgof_20190417",
               crs = 4326)
Reading layer `nga_admbnda_adm2_osgof_20190417' from data source 
  `C:\Jacobche\ISSS624\In-class_Ex\data' using driver `ESRI Shapefile'
Simple feature collection with 774 features and 16 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS:  WGS 84

Thing to learn from the code chunk above.

  • st_read() of sf package is used to import nga_admbnda_adm2_osgof_20190417 shapefile into R environment and save the imported geospatial data into simple feature data table.

Data Wrangling

Recoding NA values into string

In the code chunk below, replace_na() is used to recode all the NA values in status_cle field into Unknown.

wp_nga <- read_rds("C:/Jacobche/ISSS624/In-class_Ex/rawdata/wp_nga.rds") %>%
  mutate(status_cle = replace_na(status_cle, "Unknown"))

Exploratory Data Analysis

In the code chunk below, freq() of funModeling package is used to display the distribution of status_cle field in wp_nga.

freq(data=wp_nga, 
     input = 'status_cle')

                        status_cle frequency percentage cumulative_perc
1                       Functional     45883      48.29           48.29
2                   Non-Functional     29385      30.93           79.22
3                          Unknown     10656      11.22           90.44
4      Functional but needs repair      4579       4.82           95.26
5 Non-Functional due to dry season      2403       2.53           97.79
6        Functional but not in use      1686       1.77           99.56
7         Abandoned/Decommissioned       234       0.25           99.81
8                        Abandoned       175       0.18           99.99
9 Non functional due to dry season         7       0.01          100.00

Extracting Water Point Data

In this section, we will extract the water point records by using classes in status_cle field.

Extracting functional water point

In the code chunk below, filter() of dplyr is used to select functional water points.

wpt_functional <- wp_nga %>%
  filter(status_cle %in%
           c("Functional",
             "Functional but not in use",
             "Functional but needs repair"))

Exploratory Data Analysis (functional)

In the code chunk below, freq() of funModeling package is used to display the distribution of status_cle field in wpt_functional.

freq(data=wpt_functional, 
     input = 'status_cle')

                   status_cle frequency percentage cumulative_perc
1                  Functional     45883      87.99           87.99
2 Functional but needs repair      4579       8.78           96.77
3   Functional but not in use      1686       3.23          100.00

Extracting non-functional water point

In the code chunk below, filter() of dplyr is used to select non-functional water points.

wpt_nonfunctional <- wp_nga %>%
  filter(status_cle %in%
           c("Abandoned/Decommissioned",
             "Abandoned",
             "Non-Functional",
             "Non functional due to dry season",
             "Non-Functional due to dry season"))

Exploratory Data Analysis (non-functional)

In the code chunk below, freq() of funModeling package is used to display the distribution of status_cle field in wpt_nonfunctional.

freq(data=wpt_nonfunctional, 
     input = 'status_cle')

                        status_cle frequency percentage cumulative_perc
1                   Non-Functional     29385      91.25           91.25
2 Non-Functional due to dry season      2403       7.46           98.71
3         Abandoned/Decommissioned       234       0.73           99.44
4                        Abandoned       175       0.54           99.98
5 Non functional due to dry season         7       0.02          100.00

Extracting water point with Unknown class

In the code chunk below, filter() of dplyr is used to select water points with unknown status.

wpt_unknown <- wp_nga %>%
  filter(status_cle == "Unknown")

Performing Point-in-Polygon Count

The code chunk below performs two operations at one go. Firstly, identify water points located inside each LGA by using st_intersects(). Next, length() of Base R is used to calculate numbers of water points that fall inside each LGA.

nga_wp <- nga %>% 
  mutate(`total wpt` = lengths(
    st_intersects(nga, wp_nga))) %>%
  mutate(`wpt functional` = lengths(
    st_intersects(nga, wpt_functional))) %>%
  mutate(`wpt non-functional` = lengths(
    st_intersects(nga, wpt_nonfunctional))) %>%
  mutate(`wpt unknown` = lengths(
    st_intersects(nga, wpt_unknown)))

Saving the Analytical Data Table

The code chunk below computes the proportion of functional and non-functional water point at LGA level.

nga_wp <- nga_wp %>%
  mutate(pct_functional = `wpt functional`/`total wpt`) %>%
  mutate(`pct_non-functional` = `wpt non-functional`/`total wpt`) %>%
  select(3:4, 9:10, 18:23)

Things to learn from the code chunk above:

  • mutate() of dplyr package is used to derive two fields namely pct_functional and pct_non-functional.

  • to keep the file size small, select() of dplyr is used to retain only fields 3, 4, 9, 10, 18, 19, 20, 21, 22 and 23.

Now, we have the tidy sf data table for subsequent analysis. We will save the sf data table into rds format.

write_rds(nga_wp, "C:/Jacobche/ISSS624/In-class_Ex/data/nga_wp.rds")

Visualising the spatial distribution of water points

The code below uses qtm() of tmap package to plot side-by-side choropleth maps showing the spatial water points distribution by LGA levels in Nigeria.

nga_wp <- read_rds("C:/Jacobche/ISSS624/In-class_Ex/data/nga_wp.rds")
total <- qtm(nga_wp, "total wpt") +
  tm_layout(scale = 0.7)
wp_functional <- qtm(nga_wp, "wpt functional")+
  tm_layout(scale = 0.7)
wp_nonfunctional <- qtm(nga_wp, "wpt non-functional")+
  tm_layout(scale = 0.6)
unknown <- qtm(nga_wp, "wpt unknown")+
  tm_layout(scale = 0.7)

tmap_arrange(total, wp_functional, wp_nonfunctional, unknown, nrow=2, ncol=2)

Next we will create an interactive choropleth map for non-functional water points which would allow us to zoom in for a closer look.

tmap_mode("view")

tm_shape(nga_wp) + 
  tm_polygons("wpt non-functional", 
              breaks = c(0, 71, 141, 211, 280),
              palette = "Reds") +
  tm_layout(title= "Spatial Distribution of Non-functional Water Points") +
  tm_scale_bar()
tmap_mode("plot")

From the map, we can see that the distribution of non-functional water points is not even with LGAs like Ifelodun and Igabi having a higher concentration than others. Nevertheless, there seem to be areas where they are clustered - i.e. around the Central and Western region of Nigeria.

Global Spatial Autocorrelation

In order to confirm our observation of signs of spatial clustering, we will make use of global autocorrection technique. We will compute the global spatial autocorrelation statistics and perform spatial complete randomness test for global spatial autocorrelation.

Computing Contiguity Spatial Weights

Before we can compute the global spatial autocorrelation statistics, we need to construct a spatial weights of the study area. The spatial weights is used to define the neighbourhood relationships between the geographical units (i.e. LGA) in the study area.

The code chunk below uses poly2nb() of spdep package to compute the Queen contiguity weight matrix for Nigeria.

wm_q <- poly2nb(nga_wp, 
                queen=TRUE)

set.ZeroPolicyOption(TRUE)
[1] FALSE
summary(wm_q)
Neighbour list object:
Number of regions: 774 
Number of nonzero links: 4440 
Percentage nonzero weights: 0.7411414 
Average number of links: 5.736434 
1 region with no links:
86
Link number distribution:

  0   1   2   3   4   5   6   7   8   9  10  11  12  14 
  1   2  14  57 125 182 140 122  72  41  12   4   1   1 
2 least connected regions:
138 560 with 1 link
1 most connected region:
508 with 14 links

The summary report above shows that there are 774 LGAs in Nigeria. The most connected LGA has 14 neighbours. There are two LGAs with only one neighbours.

Row-standardised weights matrix

Next, we need to assign weights to each neighboring polygon. In our case, each neighboring polygon will be assigned equal weight (style=“W”).

rswm_q <- nb2listw(wm_q, 
                   style="W", 
                   zero.policy = TRUE)
rswm_q
Characteristics of weights list object:
Neighbour list object:
Number of regions: 774 
Number of nonzero links: 4440 
Percentage nonzero weights: 0.7411414 
Average number of links: 5.736434 
1 region with no links:
86

Weights style: W 
Weights constants summary:
    n     nn  S0       S1       S2
W 773 597529 773 285.0658 3198.414

Global Spatial Autocorrelation: Moran’s I

Maron’s I test

The code chunk below performs Moran’s I statistical testing using moran.test() of spdep.

moran.test(nga_wp$`wpt non-functional`,
           listw=rswm_q, 
           zero.policy = TRUE, 
           na.action=na.omit)

    Moran I test under randomisation

data:  nga_wp$`wpt non-functional`  
weights: rswm_q  n reduced by no-neighbour observations
  

Moran I statistic standard deviate = 20.043, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.433932927      -0.001295337       0.000471516 

From the Moran’s I test since the p-value < 2.2e-16 which is approximately 0, we can reject the null hypothesis at 99% confidence interval and can conclude that the distribution of non-functional water points in the LGAs are not randomly distributed. As the Moran I statistic = 0.433932927 > 0, we can infer that there is sign of “clustered” spatial pattern.

Computing Monte Carlo Moran’s I

The code chunk below performs permutation test for Moran’s I statistic by using moran.mc() of spdep. A total of 1000 simulation will be performed.

set.seed(1234)
bperm= moran.mc(nga_wp$`wpt non-functional`,
                listw=rswm_q, 
                nsim=999, 
                zero.policy = TRUE, 
                na.action=na.omit)
bperm

    Monte-Carlo simulation of Moran I

data:  nga_wp$`wpt non-functional` 
weights: rswm_q  
number of simulations + 1: 1000 

statistic = 0.43393, observed rank = 1000, p-value = 0.001
alternative hypothesis: greater

From the Monte Carlo simulation since the p-value = 0.001, we can reject the null hypothesis at 99% confidence interval and can conclude that the distribution of non-functional water points in the LGAs are not randomly distributed. As the Moran I statistic = 0.43393 > 0, we can infer that there is sign of “clustered” spatial pattern.

Global Spatial Autocorrelation: Geary’s

Geary’s C test

The code chunk below performs Geary’s C test for spatial autocorrelation by using geary.test() of spdep.

geary.test(nga_wp$`wpt non-functional`, listw=rswm_q)

    Geary C test under randomisation

data:  nga_wp$`wpt non-functional` 
weights: rswm_q 

Geary C statistic standard deviate = 14.457, p-value < 2.2e-16
alternative hypothesis: Expectation greater than statistic
sample estimates:
Geary C statistic       Expectation          Variance 
     0.6170907765      1.0000000000      0.0007014859 

From the Geary’s C test since the p-value < 2.2e-16 which is approximately 0, we can reject the null hypothesis at 99% confidence interval and can conclude that the distribution of non-functional water points in the LGAs are not randomly distributed. As the Geary C statistic = 0.6170907765 < 1, we can again infer that there is sign of “clustered” spatial pattern.

Computing Monte Carlo Geary’s C

The code chunk below performs permutation test for Geary’s C statistic by using geary.mc() of spdep.

set.seed(1234)
bperm=geary.mc(nga_wp$`wpt non-functional`, 
               listw=rswm_q, 
               nsim=999)
bperm

    Monte-Carlo simulation of Geary C

data:  nga_wp$`wpt non-functional` 
weights: rswm_q 
number of simulations + 1: 1000 

statistic = 0.61709, observed rank = 1, p-value = 0.001
alternative hypothesis: greater

From the Monte Carlo simulation since the p-value = 0.001, we can reject the null hypothesis at 99% confidence interval and can conclude that the distribution of non-functional water points in the LGAs are not randomly distributed. As the Geary C statistic = 0.61709 < 1, we can again infer that there is sign of “clustered” spatial pattern.

Cluster and Outlier Analysis